#multilingual dataset15/05/2025
Ultra-FineWeb: A Trillion-Token Dataset Revolutionizing LLM Accuracy Across Languages
Tsinghua University and ModelBest released Ultra-FineWeb, a trillion-token multilingual dataset that significantly improves large language model accuracy through innovative data filtering.